Lab: Binary classification with decision trees

Author: J. Hickman

The breast cancer dataset is a well studied binary classification dataset.

The copy of UCI ML Breast Cancer Wisconsin (Diagnostic) dataset is downloaded from: https://goo.gl/U2Uwz2

In this lab we will use the dataset to train a decision tree model.

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

Instructions * Read and work through all tutorial content and do all exercises below

Submission: * You need to upload ONE document to Canvas when you are done * (1) A PDF (or HTML) of the completed form of this notebook * The final uploaded version should NOT have any code-errors present * All outputs must be visible in the uploaded version, including code-cell outputs, images, graphs, etc

For reference recall the following definitions * Accuracy classification score. In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

4.1.0 Student information

Please provide the following information

# ## Name: Brian Kwon
# ## Date: 11/13/23
# ## Class Section: 001
# ## Lab Section: 001

Import

import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn import tree
from IPython.display import Image
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

4.1.1: Import

The following code will import the data file into a pandas data-frame

# LOAD THE DATAFRAME
from sklearn.datasets import load_breast_cancer
(x,y) = load_breast_cancer(return_X_y=True,as_frame=True)
df=pd.concat([x,y],axis=1)

# LOOK AT FIRST ROW
print(df.iloc[0])
mean radius                  17.990000
mean texture                 10.380000
mean perimeter              122.800000
mean area                  1001.000000
mean smoothness               0.118400
mean compactness              0.277600
mean concavity                0.300100
mean concave points           0.147100
mean symmetry                 0.241900
mean fractal dimension        0.078710
radius error                  1.095000
texture error                 0.905300
perimeter error               8.589000
area error                  153.400000
smoothness error              0.006399
compactness error             0.049040
concavity error               0.053730
concave points error          0.015870
symmetry error                0.030030
fractal dimension error       0.006193
worst radius                 25.380000
worst texture                17.330000
worst perimeter             184.600000
worst area                 2019.000000
worst smoothness              0.162200
worst compactness             0.665600
worst concavity               0.711900
worst concave points          0.265400
worst symmetry                0.460100
worst fractal dimension       0.118900
target                        0.000000
Name: 0, dtype: float64
# INSERT CODE TO PRINT ITS SHAPE AND COLUMN NAMES
print(df.shape)
print(df.columns)
(569, 31)
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')

4.1.2: Basic data exploration

We will be using y=“target” (output target) and all other remaining columns as our X (input feature) matrix.

Before doing analysis it is always good to “get inside” the data and see what we are working with

#INSERT CODE TO PRINT THE FOLLOWING DATA-FRAME WHICH SUMMARIZES EACH COLUMN 
dtypes = []
mins = []
means = []
maxs = []
for i in df.columns:
    dtypes.append(df[i].dtype)
    mins.append(df[i].min())
    means.append(np.mean(df[i]))
    maxs.append(df[i].max())
    
summary = pd.DataFrame({
    '':df.columns,
    'dtypes': dtypes,
    'min': mins,
    'mean': means,
    'max': maxs
})
summary.set_index("",inplace=True)
print(summary)
                          dtypes         min        mean         max
                                                                    
mean radius              float64    6.981000   14.127292    28.11000
mean texture             float64    9.710000   19.289649    39.28000
mean perimeter           float64   43.790000   91.969033   188.50000
mean area                float64  143.500000  654.889104  2501.00000
mean smoothness          float64    0.052630    0.096360     0.16340
mean compactness         float64    0.019380    0.104341     0.34540
mean concavity           float64    0.000000    0.088799     0.42680
mean concave points      float64    0.000000    0.048919     0.20120
mean symmetry            float64    0.106000    0.181162     0.30400
mean fractal dimension   float64    0.049960    0.062798     0.09744
radius error             float64    0.111500    0.405172     2.87300
texture error            float64    0.360200    1.216853     4.88500
perimeter error          float64    0.757000    2.866059    21.98000
area error               float64    6.802000   40.337079   542.20000
smoothness error         float64    0.001713    0.007041     0.03113
compactness error        float64    0.002252    0.025478     0.13540
concavity error          float64    0.000000    0.031894     0.39600
concave points error     float64    0.000000    0.011796     0.05279
symmetry error           float64    0.007882    0.020542     0.07895
fractal dimension error  float64    0.000895    0.003795     0.02984
worst radius             float64    7.930000   16.269190    36.04000
worst texture            float64   12.020000   25.677223    49.54000
worst perimeter          float64   50.410000  107.261213   251.20000
worst area               float64  185.200000  880.583128  4254.00000
worst smoothness         float64    0.071170    0.132369     0.22260
worst compactness        float64    0.027290    0.254265     1.05800
worst concavity          float64    0.000000    0.272188     1.25200
worst concave points     float64    0.000000    0.114606     0.29100
worst symmetry           float64    0.156500    0.290076     0.66380
worst fractal dimension  float64    0.055040    0.083946     0.20750
target                     int64    0.000000    0.627417     1.00000
# INSERT CODE TO EXPLORE THE LOAD BALANCE AND COUNT THE NUMBER OF SAMPLES FOR EACH TARGET (THEN PRINT THE RESULT)
print("Number of points with target=0:",sum(y==0),sum(y==0)/len(y))
print("Number of points with target=1:",sum(y==1),sum(y==1)/len(y))
Number of points with target=0: 212 0.37258347978910367
Number of points with target=1: 357 0.6274165202108963
# RUN THE FOLLOWING CODE TO SHOW THE HEAT-MAP FOR THE CORRELATION MATRIX
corr = df.corr();  #print(corr)                 #COMPUTE CORRELATION OF FEATER MATRIX
print(corr.shape)
sns.set_theme(style="white")
f, ax = plt.subplots(figsize=(20, 20))  # Set up the matplotlib figure
cmap = sns.diverging_palette(230, 20, as_cmap=True)     # Generate a custom diverging colormap
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmin=-1, vmax=1, center=0,
        square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
(31, 31)

When the dataset is very large then the seaborn pairplot is often very slow.

However, in this case it can still be useful to look at a subset of the features

# # RUN THE FOLLOWING CODE TO GENERATE A SEABORN PAIRPLOT 
tmp=pd.concat([df.sample(n=10,axis=1),y],axis=1)
print(tmp.shape)
sns.pairplot(tmp,hue="target", diag_kind='kde')
plt.show()
(569, 11)

#### 4.1.3 Isolate inputs/output & Split data

# INSERT CODE TO MAKE DATA-FRAMES (or numpy arrays) (X,Y) WHERE Y="target" COLUMN and X="everything else"
X = df.drop("target",axis=1)
Y = df["target"]
# INSERT CODE TO PARTITION THE DATASET INTO TRAINING AND TEST SETS
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
# INSERT CODE, AS A CONSISTENCY CHECK, TO PRINT THE TYPE AND SHAPE OF x_train, x_test, y_train, y_test
print(type(x_train), x_train.shape)
print(type(y_train), y_train.shape)
print(type(x_test), x_test.shape)
print(type(y_test), y_test.shape)
<class 'pandas.core.frame.DataFrame'> (455, 30)
<class 'pandas.core.series.Series'> (455,)
<class 'pandas.core.frame.DataFrame'> (114, 30)
<class 'pandas.core.series.Series'> (114,)

#### 4.1.4 Training the model

#### INSERT CODE BELOW TO TRAIN A SKLEARN DECISION TREE MODEL ON x_train,y_train 
from sklearn import tree
model = tree.DecisionTreeClassifier()
model = model.fit(x_train,y_train)

#### 4.1.5 Check the results

Evaluate the performance of the decision tree model by using the test data.

# INSERT CODE TO USE THE MODEL TO MAKE PREDICTIONS FOR THE TRAINING AND TEST SET 
yp_train = model.predict(x_train)
yp_test = model.predict(x_test)

Use the following reference to display the confusion matrix. SKlearn Confusion Matrix will give you the code you need.

In the function below, also print the following as part of the function output ACCURACY: 0.9035087719298246 NEGATIVE RECALL (Y=0): 0.9574468085106383 NEGATIVE PRECISION (Y=0): 0.8333333333333334 POSITIVE RECALL (Y=1): 0.8656716417910447 POSITIVE PRECISION (Y=1): 0.9666666666666667 [[45 2] [ 9 58]]

#INSERT CODE TO WRITE A FUNCTION def confusion_plot(y_data,y_pred) WHICH GENERATES A CONFUSION MATRIX PLOT AND PRINTS THE INFORMATION ABOVE (see link above for example)
def confusion_plot(y_data,y_pred):
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
    accuracy = accuracy_score(y_data,y_pred)
    n_recall = recall_score(y_data, y_pred, pos_label=0)
    n_precision = precision_score(y_data, y_pred, pos_label=0)
    p_recall = recall_score(y_data, y_pred, pos_label=1)
    p_precision = precision_score(y_data, y_pred, pos_label=1)
    cm = confusion_matrix(y_data, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    print("ACCURACY:", accuracy)
    print("NEGATIVE RECALL (Y=0):", n_recall)
    print("NEGATIVE PRECISION (Y=0):", n_precision)
    print("POSITIVE RECALL (Y=1):", p_recall)
    print("POSITIVE PRECISION (Y=1):", p_precision)
    print(cm)
    disp.plot()
    plt.show()
# RUN THE FOLLOWING CODE TO TEST YOUR FUNCTION 
print("------TRAINING------")
confusion_plot(y_train,yp_train)
print("------TEST------")
confusion_plot(y_test,yp_test)
------TRAINING------
ACCURACY: 1.0
NEGATIVE RECALL (Y=0): 1.0
NEGATIVE PRECISION (Y=0): 1.0
POSITIVE RECALL (Y=1): 1.0
POSITIVE PRECISION (Y=1): 1.0
[[165   0]
 [  0 290]]
------TEST------
ACCURACY: 0.9122807017543859
NEGATIVE RECALL (Y=0): 0.9361702127659575
NEGATIVE PRECISION (Y=0): 0.8627450980392157
POSITIVE RECALL (Y=1): 0.8955223880597015
POSITIVE PRECISION (Y=1): 0.9523809523809523
[[44  3]
 [ 7 60]]

#### 4.1.6 Visualize the tree

# INSERT CODE TO WRITE A FUNCTION "def plot_tree(model,X,Y)" VISUALIZE THE DECISION TREE (see https://mljar.com/blog/visualize-decision-tree/ for an example)
def plot_tree(model,X,Y):
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(model,
                   feature_names=list(X.columns),
                   class_names=np.unique(Y.astype(str)),
                   filled=True)

plot_tree(model,X,Y)

#### 4.1.6 Hyper-parameter turning

The “max_depth” hyper-parameter lets us control the number of layers in our tree.

Lets iterate over “max_depth” and try to find the set of hyper-parameters with the lowest training AND test error.

# COMPLETE THE FOLLOWING CODE TO LOOP OVER POSSIBLE HYPER-PARAMETERS VALUES
test_results=[]
train_results=[]

for num_layer in range(1,20):
    model = tree.DecisionTreeClassifier(max_depth=num_layer)
    model = model.fit(x_train,y_train)

    yp_train=model.predict(x_train)
    yp_test=model.predict(x_test)

    # print(y_pred.shape)
    test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)])
    train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])
# INSERT CODE TO GENERATE THE THREE PLOTS BELOW (SEE EXPECTED OUTPUT FOR EXAMPLE)
# NOTE: THERE IS A TYPO IN THE THIRD PLOT, IT SHOULD BE RECALL IN THE Y-AXIS LABEL NOT ACCURACY
plt.plot([column[1] for column in train_results],marker='o',color="blue")
plt.plot([column[1] for column in test_results],marker='o',color="red")
plt.ylabel("ACCURACY (Y=0): Training (blue) and Test (red)")
plt.xlabel("Number of layers in decision tree (max depth)")
plt.tight_layout()
plt.show()

plt.plot([column[2] for column in train_results],marker='o',color="blue")
plt.plot([column[2] for column in test_results],marker='o',color="red")
plt.ylabel("RECALL (Y=0): Training (blue) and Test (red)")
plt.xlabel("Number of layers in decision tree (max depth)")
plt.tight_layout()
plt.show()

plt.plot([column[3] for column in train_results],marker='o',color="blue")
plt.plot([column[3] for column in test_results],marker='o',color="red")
plt.ylabel("RECALL (Y=1): Training (blue) and Test (red)")
plt.xlabel("Number of layers in decision tree (max depth)")
plt.tight_layout()
plt.show()

#### 4.1.7 Train optimal model

Re-train the decision tree using the optimal hyper-parameter obtained from the plot above

#### COMPLETE THE CODE BELOW TO TRAIN A SKLEARN DECISION TREE MODEL ON x_train,y_train 
from sklearn import tree
model = tree.DecisionTreeClassifier(max_depth=[column[0] for column in test_results][np.argmax([column[1] for column in test_results])])
model = model.fit(x_train,y_train)

yp_train=model.predict(x_train)
yp_test=model.predict(x_test)
# RUN THE FOLLOWING CODE TO EVALUATE YOUR MODEL
print("------TRAINING------")
confusion_plot(y_train,yp_train)
print("------TEST------")
confusion_plot(y_test,yp_test)

plot_tree(model,x,y)
------TRAINING------
ACCURACY: 0.9516483516483516
NEGATIVE RECALL (Y=0): 0.896969696969697
NEGATIVE PRECISION (Y=0): 0.9673202614379085
POSITIVE RECALL (Y=1): 0.9827586206896551
POSITIVE PRECISION (Y=1): 0.9437086092715232
[[148  17]
 [  5 285]]
------TEST------
ACCURACY: 0.9649122807017544
NEGATIVE RECALL (Y=0): 0.9361702127659575
NEGATIVE PRECISION (Y=0): 0.9777777777777777
POSITIVE RECALL (Y=1): 0.9850746268656716
POSITIVE PRECISION (Y=1): 0.9565217391304348
[[44  3]
 [ 1 66]]